Skip to content

Add replicate-weight variance and PSU-level bootstrap to dCDH#311

Merged
igerber merged 13 commits intomainfrom
feature/dcdh-survey-variance-extensions
Apr 18, 2026
Merged

Add replicate-weight variance and PSU-level bootstrap to dCDH#311
igerber merged 13 commits intomainfrom
feature/dcdh-survey-variance-extensions

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 18, 2026

Summary

  • Adds opt-in replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) for dCDH via the library's compute_replicate_if_variance helper, routed through the inline branch in _survey_se_from_group_if (Class A IF sites) and a parallel branch in _compute_heterogeneity_test (Class B). Effective df_survey reduces to min(n_valid) - 1 across IF sites when replicates fail, matching the precedent in efficient_did.py and triple_diff.py.
  • Adds PSU-level Hall-Mammen wild bootstrap via a group_id_to_psu_code dict threaded through _compute_dcdh_bootstrap. Every target's PSU map is now constructed by group-ID lookup (not positional reuse), so a future horizon-specific masking cannot silently miscluster bootstrap draws — size mismatch raises loudly. Under the default auto-inject psu=group, the identity-map fast path preserves pre-change behavior bit-for-bit.
  • Locks replicate_weights + n_bootstrap > 0 as mutually exclusive (replicate variance is closed-form; combining would double-count), matching efficient_did.py:989, staggered.py:1869, two_stage.py:251-253.
  • Warning only fires when PSU is strictly coarser than group (n_psu_eff < n_groups_eff); auto-inject psu=group no longer triggers the stale "PSU-level deferred" warning.
  • Updates REGISTRY.md lines 651-653 to document both mechanisms; twowayfeweights() accepts replicate designs (diagnostic numbers match fit(..., survey_design=sd).twfe_*).

Methodology references (required if estimator / math changes)

  • Method names: de Chaisemartin-D'Haultfœuille (dCDH) survey-design analytical variance (library extension) and Hall-Mammen wild PSU bootstrap (library extension).
  • Paper / source links:
    • de Chaisemartin & D'Haultfœuille (2020) AER + dynamic companion paper — dCDH estimator.
    • Binder (1983) — TSL stratified-PSU variance formula (used by compute_survey_if_variance).
    • Rao-Wu reweighting convention — basis for compute_replicate_if_variance.
    • Hall & Mammen multiplier-bootstrap-at-PSU — basis for the opt-in bootstrap path.
  • Intentional deviations from the source: the dCDH papers assume iid sampling and specify only analytical cohort-recentered plug-in variance; both the group-level IF survey expansion and PSU-level Hall-Mammen bootstrap are documented library extensions in docs/methodology/REGISTRY.md (ChaisemartinDHaultfoeuille Notes on survey expansion + survey+bootstrap contract). R DIDmultiplegtDYN does not natively support either mechanism.

Validation

  • Tests added/updated: tests/test_survey_dcdh_replicate_psu.py (new file, ~40 tests across 4 classes — TestReplicateClassA, TestReplicateClassB, TestPSUBootstrap, TestInvariants). tests/test_survey_dcdh.py updated to reflect the new contract (replicate-only fits now succeed; replicate + n_bootstrap > 0 rejection tested; auto-inject no-warning invariant tested).
  • Backtest / simulation / notebook evidence: N/A — analytical replicate variance + Hall-Mammen wild PSU bootstrap are validated via the unit/integration tests in test_survey_dcdh_replicate_psu.py, including convergence-to-TSL parity for JK1 at n=5000 and end-to-end shared-draw bootstrap under a strictly coarser PSU with L_max=2.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber added 4 commits April 18, 2026 08:27
Extends the group-level Taylor Series Linearization survey support
landed in PR #307 with two opt-in variance mechanisms:

1. Replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) routed
   through the unified compute_replicate_if_variance helper at every
   IF site (overall DID_M, joiners DID_+, leavers DID_-, multi-horizon
   DID_l, placebo DID^pl_l, heterogeneity beta^het_l, twowayfeweights
   helper). Uses inline-branch pattern mirroring EfficientDiD
   (efficient_did.py:1119-1142) and TripleDifference
   (triple_diff.py:1206-1238). Effective df_survey reduces to
   min(n_valid) - 1 across IF sites when replicates fail, matching
   the precedent in efficient_did.py:1133-1135 and
   triple_diff.py:676-686.

2. PSU-level Hall-Mammen wild bootstrap via group_to_psu_map threaded
   through _compute_dcdh_bootstrap. Under auto-inject psu=group the
   identity-map fast path preserves pre-change behavior bit-for-bit.
   Under an explicitly coarser PSU (e.g., psu=state with county-level
   groups), all groups in the same PSU receive the same bootstrap
   multiplier so within-PSU correlation is preserved.

Locked contracts:
- Replicate weights and n_bootstrap > 0 are mutually exclusive
  (replicate variance is closed-form; bootstrap would double-count).
  Raises NotImplementedError, matching efficient_did.py:989 etc.
- Strata participate only through analytical TSL variance, not the
  bootstrap (multiplier weights stay unit-by-unit).
- Warning fires only when n_psu_eff < n_groups_eff (strictly coarser
  PSU). Under auto-inject psu=group the warning is correctly
  suppressed.

Adds 36 regression tests in tests/test_survey_dcdh_replicate_psu.py
covering Class A sites (via _survey_se_from_group_if), Class B sites
(heterogeneity + twowayfeweights), PSU bootstrap semantics, and
cross-cutting invariants (replicate+bootstrap rejection, HonestDiD
under replicate). REGISTRY.md lines 651-653 updated to document both
mechanisms.

Full suite: 525 passing (was 489 before this change).
Addresses local AI review findings:

1. P1: `_slice_psu_map` now enforces strict length equality between
   `target_size` and the full PSU map. Under the current bootstrap
   architecture all targets (overall DID_M, joiners DID_+,
   leavers DID_-, multi-horizon DID_l, placebo DID^pl_l) use the
   same variance-eligible group ordering from `_eligible_group_ids`,
   so the old prefix-slicing was a silent no-op. Making the check
   strict converts a fragile unstated invariant into a loud
   `ValueError` if a future refactor introduces a target whose group
   subset differs from the overall ordering (e.g., an aggregated
   target built from a non-prefix mask) — preventing a silent
   miscluster bug.
2. P2: `fit()` `survey_design` docstring updated to describe:
   - Replicate-weight variance support (BRR/Fay/JK1/JKn/SDR).
   - PSU-level Hall-Mammen wild bootstrap contract.
   - Replicate + `n_bootstrap > 0` rejection.
3. P2: `twowayfeweights()` docstring updated to document replicate
   acceptance.
4. P2: Added 3 regression tests in
   `tests/test_survey_dcdh_replicate_psu.py`:
   - `test_psu_map_size_mismatch_raises`: exercises the new strict
     equality check.
   - `test_generate_psu_or_group_weights_broadcast`: direct unit test
     of PSU-level broadcast — groups in the same PSU receive the same
     multiplier within a bootstrap replicate.
   - `test_generate_psu_or_group_weights_identity`: verifies the
     identity-map fast path is bit-identical to the non-PSU path.

All 321 dCDH-related tests still pass.
Addresses the AI review R2 finding that the prior shared-map +
length-assertion approach still relied on an unstated invariant
(all bootstrap targets use the variance-eligible group ordering).

Changes:
- `_compute_dcdh_bootstrap` now accepts `group_id_to_psu_code: Dict[Any, int]`
  and `eligible_group_ids: np.ndarray` instead of a pre-built flat
  `group_to_psu_map` array. Every target call constructs its PSU
  map via `_map_for_target(target_size, group_id_to_psu_code,
  eligible_group_ids)` — a dict lookup per group ID, not a positional
  reuse. The length-assertion is retained for the size-mismatch
  case and now sits alongside a KeyError-based check for missing
  group IDs.
- `_slice_psu_map` removed; `_map_for_target` replaces it. The
  explicit per-target construction removes the hidden semantic
  assumption that was previously enforced only by cardinality.
- `fit()` builds the dict from `_eligible_group_ids` + the dense PSU
  codes. The per-target maps are all identical today (because every
  current target uses `_eligible_group_ids` ordering), but that is
  now a consequence of the caller's choice rather than a brittle
  assumption buried in the mixin.
- Test `test_psu_map_size_mismatch_raises` replaced by
  `test_map_for_target_id_lookup`, which exercises the ID-lookup
  semantics including:
  - Correct map construction for a given eligible-group ordering.
  - Non-prefix reordering produces a different map (proving the
    ordering-from-IDs contract).
  - Length mismatch → ValueError.
  - Missing-group → ValueError.

Full regression: 401 passing (previously 321).
AI review R3 flagged the remaining positional truncation in the
multi-horizon shared-draw path:

    w_h = shared_weights[:, : u_h.size]

Under the current contract every horizon's IF vector uses the full
variance-eligible group ordering, so `u_h.size == n_groups_mh` always
holds and the slice is a no-op. But the truncation itself is a hidden
positional assumption — if a future refactor introduced horizon-
specific masking, the shared-weight alignment would break silently.

Replace the truncation with:
1. An explicit assertion `u_h.size == n_groups_mh` that raises with a
   message pointing future refactors at the right fix (threading
   target-specific group IDs through `multi_horizon_inputs`).
2. Direct reuse of the full `shared_weights` matrix — removes the
   slice semantic entirely.

Add `TestPSUBootstrap.test_multi_horizon_shared_draw_under_coarser_psu`
covering an end-to-end L_max=2 + coarser PSU case that exercises the
shared-draw path + assertion.

Full regression: 312 passing.
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1 on replicate-survey df_survey handling for dCDH inference.

Executive Summary

  • I cross-checked the changed methodology against the dCDH registry entry and the survey-design contracts. The new replicate-weight IF expansion and PSU-level Hall-Mammen bootstrap are documented library extensions, so those deviations from the papers/R are not defects by themselves: docs/methodology/REGISTRY.md#L651.
  • P1: dCDH’s new _effective_df_survey() helper overwrites the design-level replicate df with min(n_valid)-1 instead of reducing the design df when replicate solves fail. That can make top-level dCDH inference use a df that is too large on rank-deficient replicate designs, and the final value is never written back to results.survey_metadata, so downstream HonestDiD can disagree with the main dCDH surface.
  • The new tests only cover clean/full-rank replicate designs (df = R - 1), so they miss both rank-deficient replicate matrices and dropped-replicate paths.
  • I could not execute the test suite in this sandbox because Python test deps are unavailable (pytest and numpy are missing), so this is a static review.

Methodology

Code Quality

No additional findings beyond the P1 df propagation bug above.

Performance

No findings in the changed code.

Maintainability

No separate maintainability finding. Fixing the authoritative df_survey path will remove the main cross-surface inconsistency.

Tech Debt

No existing TODO entry mitigates the P1 above. The dCDH TODO items at TODO.md#L57 cover placebo SE and parity-test scope, not replicate-df propagation.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: The new tests only assert df_survey == R - 1 on clean replicate designs at tests/test_survey_dcdh_replicate_psu.py#L190 and tests/test_survey_dcdh_replicate_psu.py#L687. They do not exercise either of the two cases the new code is supposed to handle specially: rank-deficient replicate matrices (resolved.df_survey < R-1) or site-level dropped replicates (n_valid < resolved.df_survey + 1). Concrete fix: add one regression with duplicated replicate columns (rank-deficient design) and one with an invalid/dropped replicate column; assert that top-level dCDH inference, results.survey_metadata.df_survey, and HonestDiD all use the same reduced df.
  • Severity P3. Impact: _compute_heterogeneity_test()’s docstring still describes the survey path as Binder TSL only at diff_diff/chaisemartin_dhaultfoeuille.py#L3438, but the implementation now dispatches to replicate variance too at diff_diff/chaisemartin_dhaultfoeuille.py#L3634. Concrete fix: update the docstring to mention replicate-weight variance and the effective-df rule.

Path to Approval

  1. Change _effective_df_survey() so it never exceeds the base replicate design df from ResolvedSurveyDesign.df_survey; it should only reduce that df when n_valid is smaller.
  2. Write the final effective df back into survey_metadata.df_survey before creating ChaisemartinDHaultfoeuilleResults, so honest_did=True, compute_honest_did(results), and exported metadata all consume the same df.
  3. Add regression tests for duplicated replicate columns and dropped/invalid replicate columns, and assert consistent df usage across the main dCDH surface and HonestDiD.

@igerber igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels Apr 18, 2026
…ance

Addresses R1 P1 from PR #311 AI review.

The R1 reviewer flagged that `_effective_df_survey()` was overwriting
the design-level `resolved_survey.df_survey` with `min(n_valid) - 1`
whenever any replicate-aware IF site ran. For replicate designs with
rank deficiency, `ResolvedSurveyDesign.df_survey` returns
`QR-rank - 1` (per R's `survey::degf()` convention at
`diff_diff/survey.py:590`), which can be strictly smaller than
`R - 1`. The old behavior produced anti-conservative t-critical
values on the top-level dCDH surface.

Additionally, the final effective df was never persisted into
`survey_metadata.df_survey`, so HonestDiD (which reads from
`results.survey_metadata.df_survey` at `honest_did.py:973`) could
use a different df than the main dCDH surface.

Changes:
- `_effective_df_survey()` now returns `min(resolved_survey.df_survey,
  min(n_valid) - 1)` when both are defined; returns None if base df
  is undefined (rank ≤ 1) or reduced df is degenerate (< 1).
- In `fit()`, immediately before constructing
  `ChaisemartinDHaultfoeuilleResults`, mutate
  `survey_metadata.df_survey` to the final effective df (post-all-
  IF-sites, including heterogeneity). `SurveyMetadata` is a mutable
  `@dataclass` so direct assignment is safe.
- `_compute_heterogeneity_test()` docstring updated to describe the
  replicate-weight dispatch and effective-df rule (R1 P3).

Regression tests in `TestInvariants`:
- `test_rank_deficient_replicate_uses_design_df`: duplicated replicate
  columns → `design_df < R - 1` → persisted `survey_metadata.df_survey`
  must equal `design_df` (not `R - 1`).
- `test_dropped_replicate_reduces_df`: zeroed replicate columns
  produce rank deficiency in the design → persisted df must be
  bounded by `design_df`.

Full regression: 324 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 13d5e41f646d39ef2ad9b6c58c3a9542977f8ef9


Overall Assessment
⛔ Blocker

Highest unmitigated severity: P0 on replicate-survey inference when the base replicate design has undefined df_survey (QR rank <= 1).

Executive Summary

  • The prior re-review finding is only partially fixed: _effective_df_survey() now caps by the design df and the final value is written back to survey_metadata, so the original top-level rank-deficient df_survey bug is improved on the main path. See diff_diff/chaisemartin_dhaultfoeuille.py:L5009-L5040 and diff_diff/chaisemartin_dhaultfoeuille.py:L2697-L2700.
  • P0: dCDH still violates the replicate-df contract when the replicate design’s base df is undefined. _effective_df_survey() returns None, which safe_inference() treats as z-inference, and the new heterogeneity branch can replace an undefined base df with n_valid_het - 1, yielding misleading non-NaN p-values/CIs where the registry requires NaN inference.
  • P1: the new heterogeneity integration still bypasses the base design df and can lower survey_metadata.df_survey only after overall_* / event_study_effects inference has already been computed, leaving inconsistent df usage across result surfaces when heterogeneity has the smallest n_valid.
  • The replicate-weight IF variance and PSU-level Hall-Mammen bootstrap themselves are documented library extensions in the registry, so those changes are not methodology defects. See docs/methodology/REGISTRY.md:L651-L653.
  • I could not execute the tests in this sandbox because pytest and numpy are unavailable.

Methodology

  • Severity P0. Impact: the PR newly makes replicate-weight dCDH reachable, but the undefined-df contract is still wrong. The survey registry says replicate df_survey is QR-rank - 1 and is undefined when rank <= 1, which should yield NaN inference, not z-inference. See docs/methodology/REGISTRY.md:L2665-L2670. In dCDH, _effective_df_survey() returns None for that case diff_diff/chaisemartin_dhaultfoeuille.py:L5009-L5040, while safe_inference() treats df=None as normal/z inference and only returns NaN when df <= 0 diff_diff/utils.py:L152-L186. The main dCDH surfaces pass that None straight through, e.g. diff_diff/chaisemartin_dhaultfoeuille.py:L1864-L1867. The new heterogeneity path is even less safe: it starts from resolved.df_survey, but if that is None it explicitly falls back to n_valid_het - 1 diff_diff/chaisemartin_dhaultfoeuille.py:L3499-L3509 and diff_diff/chaisemartin_dhaultfoeuille.py:L3670-L3689. That can silently produce valid-looking p-values/CIs where the registry requires NaN. Concrete fix: keep exported survey_metadata.df_survey=None if that is the project convention, but introduce an internal dCDH inference-df helper that maps “replicate design + undefined effective df” to 0 before every safe_inference() call; in _compute_heterogeneity_test(), never promote an undefined base df to n_valid_het - 1. Add a regression with a rank-1 replicate matrix that still yields finite replicate SEs and assert NaN t_stat/p_value/conf_int on overall, event-study, and heterogeneity outputs.
  • Severity P1. Impact: the previous top-level fix is not fully carried through the new heterogeneity integration. _compute_heterogeneity_test() re-derives df_s as min(replicate_n_valid_list) - 1 instead of reusing the base-capped _effective_df_survey() diff_diff/chaisemartin_dhaultfoeuille.py:L3499-L3509, so rank-deficient replicate designs can still get a heterogeneity df that exceeds the design df. Separately, the main dCDH inference surfaces are computed before heterogeneity runs, e.g. diff_diff/chaisemartin_dhaultfoeuille.py:L1628-L1630 and diff_diff/chaisemartin_dhaultfoeuille.py:L1864-L1903, but heterogeneity appends to the shared replicate tracker only later diff_diff/chaisemartin_dhaultfoeuille.py:L2620-L2633, after which survey_metadata.df_survey is overwritten with the smaller final value diff_diff/chaisemartin_dhaultfoeuille.py:L2697-L2700. HonestDiD then reads that final metadata df diff_diff/honest_did.py:L970-L978. So when heterogeneity has the smallest n_valid, overall_* / event_study_effects can be anti-conservative relative to heterogeneity_effects, survey_metadata.df_survey, and HonestDiD. Concrete fix: initialize heterogeneity’s starting df from _effective_df_survey(resolved, replicate_n_valid_list) instead of raw min(n_valid)-1, and finalize the shared replicate df before public inference is frozen, or recompute the already-built dCDH inference fields after heterogeneity updates the tracker.
  • Severity P3. Impact: no defect here. The replicate-weight IF expansion and PSU-level Hall-Mammen wild bootstrap are documented deviations from the dCDH papers / R implementation, so I am not treating those changes as source-material mismatches. See docs/methodology/REGISTRY.md:L651-L653. Concrete fix: none.

Code Quality
No additional findings beyond the duplicated / inconsistent df logic already captured above.

Performance
No findings.

Maintainability
No separate maintainability finding beyond the copied df logic in _compute_heterogeneity_test(), which is already the root cause of the P1.

Tech Debt
No separate finding, but the blocking issues above are not mitigated by existing TODO tracking. The dCDH TODO entries currently cover other items only. See TODO.md:L51-L60.

Security
No findings.

Documentation/Tests

  • Severity P2. Impact: the new test coverage does not exercise either blocker above. The heterogeneity replicate test only checks finite beta/se tests/test_survey_dcdh_replicate_psu.py:L319-L337, and the new df regressions cover top-level metadata / HonestDiD but do not activate heterogeneity= or the undefined-df (rank <= 1) case tests/test_survey_dcdh_replicate_psu.py:L715-L800. Concrete fix: add one rank-1 replicate design that still produces finite SEs and assert NaN inference on all public dCDH surfaces, plus one rank-deficient replicate design with heterogeneity= asserting consistent df usage across overall_*, event_study_effects, heterogeneity_effects, survey_metadata.df_survey, and HonestDiD.
  • Severity P3. Impact: the detailed Phase 3 heterogeneity registry note still describes only TSL survey inference docs/methodology/REGISTRY.md:L618-L618, while the implementation and the top-level dCDH survey note now add a replicate-weight branch diff_diff/chaisemartin_dhaultfoeuille.py:L3454-L3471 and docs/methodology/REGISTRY.md:L651-L653. Concrete fix: update the detailed heterogeneity note so the registry has one consistent survey-inference contract.
  • I could not run the added tests here because the sandbox lacks pytest and numpy.

Path to Approval

  1. Add a dCDH-local inference-df helper that converts “replicate design with undefined effective df” into df=0 for safe_inference(), and use it at every dCDH inference callsite.
  2. In _compute_heterogeneity_test(), stop deriving df_s from raw min(replicate_n_valid_list) - 1; use the same base-capped helper as the rest of dCDH, and never replace an undefined base df with n_valid_het - 1.
  3. Make all public dCDH inference surfaces use one final shared replicate df when heterogeneity= is active: either run heterogeneity before the other safe_inference() calls or recompute the already-built public inference fields afterward.
  4. Add regression tests for a rank-1 replicate design (finite SE, NaN inference required) and for a rank-deficient design with heterogeneity= asserting df consistency across results and HonestDiD.

…stency

Addresses R2 P0 + P1 from PR #311 AI review.

R2 P0: `_effective_df_survey` returns None when replicate design has
undefined base df (QR-rank ≤ 1), but `safe_inference(df=None)` uses
z-inference — NOT NaN. So a rank-1 replicate design with finite SE
silently produced valid-looking p-values/CIs where REGISTRY.md
mandates NaN.

R2 P1: `_compute_heterogeneity_test` re-derived its local df_s from
`min(replicate_n_valid_list) - 1` directly — bypassing the base-df
cap the rest of dCDH uses. When df_s was None, the in-loop fallback
promoted it to `n_valid_het - 1`, which rescued a rank-deficient
design in heterogeneity inference that the main surface correctly
treated as NaN. Separately, heterogeneity appended its own n_valid
AFTER main inference had frozen, leaving `overall_t`, `event_study
_effects`, etc. computed with a larger-than-final df while HonestDiD
(reading the post-heterogeneity `survey_metadata.df_survey`) used the
smaller final df.

Changes:
- Add `_inference_df(effective_df, resolved_survey)` helper next to
  `_effective_df_survey`: coerces None to 0 when replicate (triggers
  safe_inference's NaN branch), else returns effective_df as-is.
- Wrap every dCDH `safe_inference(df=...)` call site (~14 sites) with
  `_inference_df(df, resolved_survey)`.
- In `_compute_heterogeneity_test`, derive `df_s` via
  `_effective_df_survey(resolved, list(replicate_n_valid_list or []))`
  instead of the ad-hoc `min - 1`. When `df_s is None` (undefined base
  df), keep `df_s_local = None` — never promote.
- Add a post-heterogeneity finalization block in `fit()` that re-runs
  `safe_inference` for every public surface (overall, joiners,
  leavers, multi-horizon, placebo-horizon, heterogeneity) with the
  FINAL effective df. This guarantees every surface and
  `survey_metadata.df_survey` use the same df.
- Update REGISTRY.md heterogeneity Note to document replicate-weight
  dispatch and the effective-df rule.

Regression tests (TestInvariants):
- `test_rank_1_replicate_forces_nan_inference`: identical replicate
  columns → rank-1 design → design df None → all inference fields NaN.
- `test_heterogeneity_replicate_cross_surface_df_consistency`:
  rank-deficient replicate design with `heterogeneity=` active —
  asserts `overall_p_value` matches `t(final_df)` distribution and
  heterogeneity fields agree.

Full regression: 326 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e62d4cf479d3c4aede4dc595791d2b29fa7b8b1


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2

Executive Summary

Methodology

  • Severity P3. Impact: the new registry note says replicate HonestDiD inference uses df_survey = min(n_valid_across_sites) - 1, but the implementation and the detailed heterogeneity note use the capped rule min(resolved_survey.df_survey, min(n_valid) - 1) and keep rank-deficient base designs undefined. That is a documentation mismatch in the project’s methodology source of truth, even though the code path itself looks correct. See docs/methodology/REGISTRY.md#L653, docs/methodology/REGISTRY.md#L618, and diff_diff/chaisemartin_dhaultfoeuille.py#L5109. Concrete fix: update the registry note to the full rule, including the resolved_survey.df_survey cap and the undefined-base-df case.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining items above are not currently tracked in TODO.md.

Security

No findings.

Documentation/Tests

I could not run the tests here because the sandbox is missing numpy and pandas.

Two non-blocker items from PR #311 AI review R3:

- P2: `test_placebo_under_replicate` queried `placebo_event_study[1]`
  but placebo horizons are stored under NEGATIVE keys (REGISTRY.md
  line 546 — `-l` corresponds to backward outcome difference
  `Y_{g, F_g-1-l} - Y_{g, F_g-1}`; storage at
  `chaisemartin_dhaultfoeuille.py:2402` uses `-lag_l` as the key).
  The test's assertion silently skipped (`.get(1)` returned None)
  so the replicate placebo SE path was effectively unverified.
  Fixed to use key `-1` and added explicit assertions on `se`,
  `p_value`, and `conf_int`.

- P3: REGISTRY line 653 described the replicate-effective df as
  `min(n_valid_across_sites) - 1`, but the actual rule (and the
  detailed heterogeneity note at line 618) is the capped form
  `min(resolved_survey.df_survey, min(n_valid) - 1)`. Updated the
  wording to include the design-df cap and the undefined-base-df
  NaN case — matching the code and line 618.

Full regression: 316 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b48ee534e3a0934d8767fa9f8c22eb24788ccbd6


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2

Executive Summary

Methodology

  • Severity P3. Impact: the dCDH methodology registry is internally contradictory. The overview in docs/methodology/REGISTRY.md#L466 still says replicate weights and PSU-level bootstrap are deferred, but the supported-contract checklist and notes in docs/methodology/REGISTRY.md#L651 through docs/methodology/REGISTRY.md#L653 now say they are implemented. Because REGISTRY.md is the methodology source of truth, this creates avoidable confusion about whether the PR is a documented extension or an undocumented deviation. Concrete fix: update the overview sentence at line 466 to match the new contract, and keep only the still-deferred single-period placebo-SE limitation called out separately.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not find these items tracked in TODO.md under “Tech Debt from Code Reviews.”

Security

  • No findings.

Documentation/Tests

Addresses PR #311 AI review R4 (1 × P2 + 2 × P3).

P2: The cross-surface df consistency test used a duplicated-replicate
design where `resolved.df_survey` binds as the cap BEFORE
heterogeneity contributes. That scenario can't detect a broken
late-propagation path. Added a monkeypatch-based regression
(`test_heterogeneity_late_nvalid_propagates_to_all_surfaces`) that:

1. First counts the main-surface calls to
   `compute_replicate_if_variance` (no heterogeneity).
2. Then runs fit with `heterogeneity=` active and a patched helper
   that returns a SMALLER n_valid ONLY on the heterogeneity call
   (call index > main_count), leaving main surfaces at full n_valid.
3. Asserts every public surface (`overall_p_value`,
   `event_study_effects`, `placebo_event_study`,
   `heterogeneity_effects`, `survey_metadata.df_survey`) uses the
   reduced df via t(reduced_n_valid - 1) distribution.

The test caught two real bugs that the duplicated-column test
missed:

- `effective_overall_t/p/ci` are set by copying from the raw
  `overall_t/p/ci` at line ~2331 (before heterogeneity runs) and
  are used as the results-constructor inputs at line ~2776.
  The R2 recompute block only reassigned `overall_*` — so
  `effective_overall_*` (what actually shipped) stayed on the
  pre-heterogeneity df. Fix: recompute `effective_overall_*` from
  `effective_overall_att/se` (covering both the normal and
  cost-benefit delta paths), and keep `overall_*` in sync for
  downstream consumers.
- `placebo_event_study_dict[-lag]` holds VALUE copies of
  `placebo_horizon_inference[lag]`, so mutating the inner dict in
  the recompute block does NOT propagate to the public surface.
  Fix: explicitly update `placebo_event_study_dict[-lag]` alongside
  the mutation in the recompute loop.

P3a: REGISTRY overview line 466 still said "Replicate weights and
PSU-level bootstrap are deferred" — contradicting the updated
checklist at lines 651-653. Rewrote to match the new contract.

P3b: `diff_diff/guides/llms-full.txt:325` still said
"Replicate weights and PSU-level bootstrap deferred". Updated to
reflect replicate support + opt-in PSU-level Hall-Mammen bootstrap
+ the mutual-exclusion contract with `n_bootstrap > 0`.

Full regression: 327 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bcc14c41e18ac4a09ce36f87e044857c7e829250


Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0

Executive Summary

  • Prior re-review doc drift on dCDH survey support appears resolved: the methodology registry now documents replicate-weight variance and PSU-level Hall-Mammen bootstrap as supported library extensions in docs/methodology/REGISTRY.md:L466 and docs/methodology/REGISTRY.md:L651.
  • I do not see an undocumented methodology deviation in the new replicate-weight variance or PSU-level bootstrap mechanics themselves; the main issue is an inference-propagation bug on a public results surface.
  • Severity P0: the new late-df_survey recompute path updates overall/event-study/placebo/heterogeneity outputs, but it does not update results.normalized_effects. When heterogeneity= contributes the binding replicate n_valid, normalized-effects inference can disagree with the final survey_metadata.df_survey and silently report wrong p_value/conf_int (and potentially miss the all-NaN gate when the final replicate df becomes undefined).
  • The new regression test that claims to cover “every public surface” omits normalized_effects, which is why this bug can pass the new suite.
  • I could not execute pytest here because pytest is not installed in this environment.

Methodology

Affected methods checked: dCDH survey IF expansion / replicate-weight analytical variance, heterogeneity under replicate weights, and PSU-level Hall-Mammen wild bootstrap. The new extensions are documented in docs/methodology/REGISTRY.md:L466, docs/methodology/REGISTRY.md:L618, and docs/methodology/REGISTRY.md:L651. I did not find an undocumented paper/registry mismatch in those mechanics.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The blocker above is not mitigated by tracking; I did not find a matching entry under TODO.md:L51.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new late-n_valid regression explicitly says it covers “every public surface” but only asserts overall, event-study, placebo, heterogeneity, and survey_metadata.df_survey in tests/test_survey_dcdh_replicate_psu.py:L922 and tests/test_survey_dcdh_replicate_psu.py:L936. normalized_effects is also a public surface, including export via diff_diff/chaisemartin_dhaultfoeuille_results.py:L1216. That gap is why the P0 above can slip through. Concrete fix: extend test_heterogeneity_late_nvalid_propagates_to_all_surfaces() to assert res.normalized_effects[h]["p_value"] and conf_int use the final reduced df; add a second case with reduced_n_valid=1 to verify the all-NaN branch.
  • I could not run the test suite here because pytest is unavailable.

Path to Approval

  1. Update the final replicate-df propagation block in diff_diff/chaisemartin_dhaultfoeuille.py:L2694 so normalized_effects_out is recomputed with _final_inf_df, or defer building normalized-effects inference until after _final_eff_df is known.
  2. Extend tests/test_survey_dcdh_replicate_psu.py:L922 to cover results.normalized_effects, including a case where late heterogeneity reduces the final df and a case where it makes the final replicate df undefined (n_valid < 2).

…branch

Addresses PR #311 AI review R5 P0 + P2.

P0: The R2 post-heterogeneity finalization block updated overall,
joiners, leavers, multi-horizon event-study, placebo-horizon, and
heterogeneity surfaces, but missed `normalized_effects_out` — which
is built at line 2472-2487 from a delta-method SE (`SE(DID_l) /
delta^D_l`) using the pre-heterogeneity `_df_survey`. Under a
late-arriving replicate `n_valid` from heterogeneity, the public
`results.normalized_effects` dict shipped stale p-values/CIs and
could miss the all-NaN gate when the final replicate df turned
undefined.

Fix: extend the finalization block to also recompute every
`normalized_effects_out[h]` entry with `_final_inf_df`.

P2: Extended `test_heterogeneity_late_nvalid_propagates_to_all_
surfaces` to assert `res.normalized_effects` p-values match
`t(final_df)` distribution. Added a new case
`test_late_nvalid_below_two_forces_all_surfaces_nan` that drives
the final `n_valid` to 1 (reduced df = 0 → None via `_inference_df`
coercion) and verifies EVERY public surface (overall,
event_study_effects, placebo_event_study, heterogeneity_effects,
normalized_effects) flips to NaN — not silently keeps the finite
values computed with the larger intermediate df.

Full regression: 328 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: abdfea4630e86fd7079870ee3602696871d2068f


Overall Assessment

✅ Looks good

Highest unmitigated severity: P3

Executive Summary

Methodology

Code Quality

  • Severity P3. Impact: the new warning gate uses all positive-weight raw groups, while the actual PSU bootstrap map is built from post-filter eligible_group_ids; if upstream dCDH filtering drops groups, the code can still emit a warning even when the effective bootstrap target is one group per PSU. Concrete fix: compute the warning counts from eligible_group_ids_bootstrap / group_id_to_psu_code_bootstrap (or from the same post-filter mask used to build them), and add a regression with dropped groups sharing PSUs. See diff_diff/chaisemartin_dhaultfoeuille.py:L1973-L1990 and diff_diff/chaisemartin_dhaultfoeuille.py:L2145-L2165.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not find a matching tracked limitation under the “Tech Debt from Code Reviews” section of TODO.md:L51.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: two in-code docstrings still say dCDH’s bootstrap is clustered at the group level only, which no longer matches the implemented survey PSU path and can mislead future readers about the supported contract. Concrete fix: update both docstrings to say “group-level by default; PSU-level under survey_design when PSUs are coarser than groups,” and point readers to the registry note. See diff_diff/chaisemartin_dhaultfoeuille.py:L302-L310 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L5-L10.
  • No additional test-gap findings in the re-review scope. The new regression coverage now exercises the previous normalized_effects gap and the final-NaN path in tests/test_survey_dcdh_replicate_psu.py:L996-L1166. I could not execute the suite in this environment.

Addresses PR #311 AI review R6 (2 × P3 cleanups).

P3 #1: Warning gate was computed from raw positive-weight groups,
not the post-filter eligible-group set used to build the bootstrap
PSU map. Panels where upstream dCDH filtering drops groups that
share PSUs with kept groups could emit a misleading "PSU coarser
than group" warning even when the effective bootstrap is one group
per PSU.

Fix: count PSUs and groups from `_eligible_group_ids` (the same set
feeding `group_id_to_psu_code_bootstrap`), preserving the within-
group-constant-PSU invariant by taking each eligible group's first
positive-weight PSU label.

P3 #2: Two docstrings said the bootstrap is "clustered at the group
level" only — now incomplete after the PSU-level survey path:
- `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring:
  extended to note PSU-level Hall-Mammen wild clustering under
  `survey_design` with coarser PSU.
- `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module
  docstring: documents the identity-map fast path (auto-inject
  `psu=group`), the PSU-level broadcast when PSU is strictly
  coarser, and points to REGISTRY.md for the full contract.

Full regression: 318 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 41474069e87331a1019a4177666ca4cb52b80a9f


Overall Assessment

Looks good

Highest unmitigated severity: P3

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new untracked findings. I did not see a silent-correctness issue being deferred, and I did not find a matching blocker under TODO.md:51.

Security

  • No findings.

Documentation/Tests

Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups).

R7 P3 #1: Several sites still said dCDH "always clusters at the
group level" — which was true when the PR was written but is now
incomplete given the PSU-level Hall-Mammen wild bootstrap path
under `survey_design`. Updated to distinguish user-specified
`cluster=` (still unsupported, raises NotImplementedError) from
automatic PSU-level clustering (takes over under `survey_design`
with strictly-coarser PSUs; identity under auto-inject `psu=group`):
- `docs/methodology/REGISTRY.md:592` Note (cluster contract) —
  rewrote to describe both paths; dropped "Phase 1" framing.
- `docs/methodology/REGISTRY.md:636` checklist — added the
  automatic PSU-level upgrade clause.
- `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor
  docstring — same contract split.
- `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503`
  `cluster=` error messages — removed "Phase 1" phrasing, added
  PSU-level-under-survey_design context.
- `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated
  to match the new error wording (no longer pins "Phase 1").

R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will
add multiplier-bootstrap support for placebo and bootstrap covers
`DID_M`, `DID_+`, `DID_-` only — both stale after this PR's
L_max >= 1 placebo and event-study bootstrap paths. Rewrote to
scope the NaN-SE contract to `L_max=None` only and describe the
full bootstrap coverage (overall, joiners, leavers, per-horizon
event-study, placebo horizons, shared weights for sup-t bands).

Full regression: 336 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9368ae65b89c347e469c9d711607e50acf434a5a


Overall Assessment

Looks good

Highest unmitigated severity: P3

Executive Summary

Methodology

No findings. Affected methods are the dCDH survey IF/replicate variance path on the main ATT, event-study, and placebo surfaces diff_diff/chaisemartin_dhaultfoeuille.py:L1718-L1768, diff_diff/chaisemartin_dhaultfoeuille.py:L1836-L1909, the survey heterogeneity path diff_diff/chaisemartin_dhaultfoeuille.py:L3540-L3818, PSU-level Hall-Mammen bootstrap diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L68-L399, and survey-backed twowayfeweights() aggregation diff_diff/chaisemartin_dhaultfoeuille.py:L5604-L5694. These are explicitly documented as library extensions in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653, and the code routes through the shared survey helpers consistently.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a silent-correctness issue being deferred via TODO.md:L51-L83.

Security

No findings.

Documentation/Tests

Absorbs #312 (sdid scale fix), #313 (roadmap refresh + dCDH docstring
rewrite), and #314 (within_transform convergence warnings) from main.

Conflicts resolved in:
- diff_diff/chaisemartin_dhaultfoeuille.py: took main's comprehensive
  Phase 1-3 feature list in the class docstring but merged in the
  PR #311 group-vs-PSU bootstrap-clustering framing and the
  replicate-weight survey-support line. Kept the PR #311
  'user-specified cluster= not supported + automatic PSU-level under
  survey_design' wording for the cluster= parameter docstring (strictly
  more accurate than main's 'always clusters at the group level' text).
- diff_diff/guides/llms-full.txt: kept main's more detailed placebo SE
  contract paragraph (which already distinguishes single-period NaN
  from multi-horizon analytical/bootstrap) and appended the sup-t /
  shared-weights / cross-horizon coverage details from the PR #311
  update. Kept the PR #311 survey_design signature comment that
  mentions TSL + replicate + PSU bootstrap.

Full regression across touched areas: 336 + 324 passing.
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e537b0caefb85db7ff525ebf0d01e11ab70546e


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 18, 2026
@igerber igerber merged commit 76fb12c into main Apr 18, 2026
19 checks passed
@igerber igerber deleted the feature/dcdh-survey-variance-extensions branch April 18, 2026 19:10
igerber added a commit that referenced this pull request Apr 18, 2026
…odo config

Packages five merged PRs since v3.1.2 as patch release 3.1.3:

- #311 Replicate-weight variance and PSU-level bootstrap for dCDH — new
  variance_method="replicate" (BRR / Fay / JK1 / JKn / SDR) and PSU-level
  multiplier bootstrap, with df-aware inference and group-level PSU map.
- #321 Zenodo DOI auto-minting config — .zenodo.json + top-level LICENSE
  so the next GitHub Release mints a concept + versioned DOI automatically.
- #319 Silent sparse->dense lstsq fallback signaling in ImputationDiD and
  TwoStageDiD — emits ConvergenceWarning instead of switching paths silently.
- #317 Non-convergence signaling in TROP alternating-minimization solvers,
  including LOOCV and bootstrap aggregation. Top-level warning aggregation.
- #320 /bump-version skill now updates CITATION.cff; single RELEASE_DATE
  resolved upfront and threaded through all date-bearing files.

Version strings bumped in diff_diff/__init__.py, pyproject.toml,
rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff
(version: 3.1.3, date-released: 2026-04-18). CHANGELOG populated with
Added / Fixed / Changed sections and comparison-link footer.

Per project SemVer convention, minor bumps are reserved for new estimators
or new module-level API; additive extensions to existing estimators (like
PR #311's new variance_method values) are patch-level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant